Small but Powerful: A Deep Dive into Small Language Models (SLMs)

Published in

Version 1

7 min readDec 13, 2023

Created using Microsoft Bing Image Creator

Large Language Models (LLMs) have been popular for quite some time now. Lately, Small Language Models (SLMs) have enhanced our capacity to handle and communicate with various natural and programming languages. However, some user queries require more accuracy and domain knowledge than what the models trained on the general language can offer. Also, there is a demand for custom Small Language Models that can match the performance of LLMs while lowering the runtime expenses and ensuring a secure and fully manageable environment.

In this article, we explore Small Language Models, their differences, reasons to use them, and their applications. We also use fine-tuning methods on Llama-2–13b, a Small Language Model, to address the above-mentioned issues.

Furthermore, our goal is to examine the possibility of making this process platform independent. For this purpose, we selected Databricks as a platform that can be transferred among the Azure, Amazon Web Services (AWS) or Google Cloud Platform.

AI-powered code analysis and documentation — Decipher | Version 1

In the context of artificial intelligence and natural language processing, SLM can stand for ‘Small Language Model’. It is a lightweight generative AI model. The label “small” in this context refers to a) the size of the model’s neural network, b) the number of parameters and c) the volume of data the model is trained on. There are several implementations that can run on a single GPU, and over 5 billion parameters, including Google Gemini Nano, Microsoft’s Orca-2–7b, and Orca-2–13b, Meta’s Llama-2–13b and others.

There are some differences between SLMs and LLMs. First, the LLMs are bigger in size and have undergone more widespread training when weighed with SLMs. Second, the LLMs have notable natural language processing abilities, making it possible to capture complicated patterns and outdo in natural language tasks, for example complex reasoning. Finally, the LLMs can understand language more thoroughly while, SLMs have restricted exposure to language patterns. This does not put SLMs at a disadvantage and when used in appropriate use cases, they are more beneficial than LLMs.

There are several reasons to use these models. They are gaining popularity and relevance in various applications especially with regards to sustainability and amount of data needed for training. From the hardware point of view, it is cheaper to run i.e., SLMs require less computational power and memory and it is suitable for on-premises and on-device deployments making it more secure. From the usage point of view, these are small language models, that are trained or fine-tuned for particular domains or tasks, hence they can have specialized lingo and knowledge from legal jargons to medical diagnoses protecting intellectual property. Depending on the scenario, SLMs would be cheaper and efficient.

SLMs find applications in a wide range of sectors, spanning healthcare to technology, and beyond. The common use cases across all these industries include summarizing text, generating new text, sentiment analysis, chatbots, recognizing named entities, correcting spelling, machine translation, code generation and others.

Language model fine-tuning is a process of providing additional training to a pre-trained language model making it more domain or task specific. This process involves updating the model’s parameters with additional training data to improve its performance in specific areas or applications such as text generation, question answering, language translation, sentiment analysis, and others. We are interested in ‘domain-specific fine-tuning’ as it is especially useful when we want the model to understand and generate text relevant to specific industries or use cases.

Hardware Requirements

The hardware requirements may vary based on the size and complexity of the model, the scale of the project, and the dataset. It’s a good practice to start with a small-scale and then scale up as necessary. However, here are some general guidelines for fine-tuning a private language model.

GPUs (Graphics Processing Units) for processing. It could be cloud-based.
A fast and reliable internet connection for data transfer.
A powerful multi-core CPU for data pre-processing, and managing distribution steps.
Sufficient memory, and fast and ample storage.

Figure 1. The virtual machine used for our fine-tuning process.

Data Preparation

The quality and feasibility of your dataset significantly impact the performance of the fine-tuned model. For our goal in this phase, we need to extract text from PDF’s, to clean and prepare the text, then we generate question and answers pairs from the given text chunks. Finally, proceed with the fine-tuning process.

It is to be noted that, we used a LLM such as GPT-3.5 for generating the Q&A pairs (which might defeat the purpose here), however, we could try with SLMs here as well to generate these pairs depending on the use case.

Figure 2. Key steps involved in preparing a dataset for fine-tuning.

Fine-tuning Process

We used HuggingFace, its full suite of components, and integrated them to accomplish this task.

Figure 3. Components integrated for fine-tuning.

The pre-trained language model Llama-2–13b-chat-hf model was chosen. For the domain-specific dataset, we converted into HuggingFace datasets type and used the tokenizer accessible through the HuggingFace API. In addition, quantization used to reduce the precision of numerical values in a model allowing, data compression, computation and storage efficiency and noise reduction. Performance configuration was also enabled for efficient adaptation of pre-trained models. Finally, training arguments were used for defining particulars of the training process and the trainer was passed parameters, data, and constraints.

Training Process

We conducted a 50-epoch fine-tuning of the model. An epoch refers to one full cycle through the training dataset. It required about 16 hours to complete, and our CPU and RAM resources were not fully utilized during the process. It’s possible that a machine with limited CPU and RAM resources might suit the process. Our GPU usage aligns with the stated model requirements; perhaps increasing the batch size could accelerate the training process.

Overall, despite the initial challenges of understanding the interconnections and facing several unsuccessful attempts, the fine-tuning process appeared to run smoothly and consistently. The monetary cost for this fine-tuning process was around $100/£83. However, this cost above did not include the cost of all trials and errors that concluded to the final fine-tuning process.

Results and Observations

Please note that we used GPT-3.5 to generate questions and answers from the training data. The model that we fine-tuned is Llama-2–13b-chat-hf has only 13 billion parameters while GPT-3.5 has 175 billion. In other words, we are expecting a small model to perform as well as a large one. Therefore, due to GPT-3.5 and Llama-2–13b-chat-hf difference in scale, direct comparison between answers was not appropriate, however, the answers must be comparable.

Embedding were created for the answers generated by the SLM and GPT-3.5 and the cosine distance was used to determine the similarity of the answers from the two models.

Figure 6. Similarity distribution of GPT-3.5 answers and Llama-2–13b-chat-hf answers.

According to Figure 6, 0.5 was established as the cut-off for quality and 0.6 represents the average quality of the result produced by Llama-2–13b-chat-hf. Anything above 0.5 was considered acceptable and below unacceptable. This is because, as the similarly ranges from -1 being opposite, 1 being an exact match, and 0 being unrelated to the value of 0.5, which seems reasonable argument.

For the fine-tuning process, we use about 10,000 question-and-answer pairs generated from the Version 1’s internal documentation. But for evaluation, we selected only questions that are relevant to Version 1 and the process. Further analysis of the results showed that, over 70% are strongly similar to the answers generated by GPT-3.5, that is having similarity 0.5 and above (see Figure 6). In total, there are 605 considered to be acceptable answers, 118 somewhat acceptable answers (below 0.4), and 12 unacceptable answers.

The fine-tuned model seems to competent at extracting and maintaining knowledge while demonstrating the ability to generate answers to the specific domain. A platform agnostic approach allowed us to execute the same fine-tuning processes on AWS and achieve almost identical results without any changes to the code.

Conclusions

There are some drawbacks to SLMs. A single constant running instance of this system will cost approximately $3700/£3000 per month. The knowledge bases are more limited than their LLM counterparts meaning, it cannot answer questions like who walked on the moon and other factual queries. Due to the narrow understanding of language and context it can produce more restricted and limited answers. Although, the future looks somewhat bright for SLMs in their own merits. The voyage of language models highlights a fundamental message in AI, i.e., small can be impressive, assuming that there is constant advancement and modernization. In addition, there is an understanding that efficiency, versatility, environmentally friendliness, and optimized training approaches grab the potential of SLMs.

We would have to wait and see how popular SLMs will become compared to their LLM counterparts especially with the recent launch of SLMs, such as Gemini Nano, Mixtral, Phi-2, and others.

Note to the reader:

Alexander Suvorov, our Senior Data Scientist conducted the fine-tuning processes of Llama 2.

About the Author:

Rosemary J Thomas, PhD, is a Senior Technical Researcher at the Version 1 AI Labs.